ROCm e HIP: Um Tutorial Detalhado de 10 Capítulos: A Natureza Centrada na Memória do Desempenho da GPU

Na aceleração por GPU, devemos abandonar a mentalidade de "computação primeiro". O desempenho moderno é determinado por Gestão de Memória: a coordenação da alocação, sincronização e otimização dos dados entre o host (CPU) e o dispositivo (GPU).

1. A Disparidade entre Memória e Computação

Enquanto o rendimento aritmético da GPU ($TFLOPS$) aumentou exponencialmente, a largura de banda da memória ($GB/s$) cresceu muito mais lentamente. Isso cria uma lacuna em que as unidades de execução muitas vezes estão "privadas", aguardando dados chegarem da VRAM. Consequentemente, programação para GPU frequentemente é programação de memória.

2. O Modelo Roofline

Este modelo visualiza a relação entre Intensidade Aritmética (FLOPs/Byte) e desempenho. As aplicações normalmente se dividem em duas categorias:

Limitado pela Memória: Limitado pela largura de banda (a inclinação íngreme).
Limitado pela Computação: Limitado pelos TFLOPS máximos (o teto horizontal).

3. A Taxa de Movimentação de Dados

O principal gargalo de desempenho raramente é a matemática; é a latência e o custo energético de mover um byte pela barramento PCIe ou da HBM. Códigos de alto desempenho priorizam a localização dos dados e minimizam as transferências entre host e dispositivo.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.